Generalised self-referential file system

ABSTRACT

Embodiments of an unrestricted binary unambiguous file or memory mapped object are disclosed along with descriptions of corresponding reading and writing processes. The file or object may be used to store data of any type. ‘Binary unambiguous’ refers to a quality whereby the binary data stored within the datastore (file or memory map) is always and uniquely identified by a binary type identifier readily discerned from the self same map. Similarly, the term ‘unrestricted’ refers to the capacity of the protocol to accept data of any type, nature, format, structure or context, in a manner that retains the binary unambiguous nature of the invention for each data item. A storage object so created may be easily read by dedicated software, as it is of simple definition and is durable in nature. Its generality removes the need for repeated updates and versions of the underlying protocol.

CROSS-REFERENCE TO RELATED APPLICATION

This application claims priority to Great Britain Patent Application No.0802573.6, filed on Feb. 12, 2008, and entitled “A GeneralisedSelf-Referential File System.” Great Britain Patent Application No.0802573.6 is hereby incorporated herein by reference in its entirety.

FIELD

The disclosed technology relates to methods, systems and computerprogramme products for storing data of multiple types in a singlelogical data structure.

BACKGROUND

The storage protocols currently in use in the computer industry fallbroadly into two categories: those which are proprietary in nature andnot intended to be shared between applications, (though specialistconversion programs may exist); and those that are intentionally publicand open, and designed to store data in a reasonably generalised format.While the former are clearly restricted in scope, and difficult tointerpret without skilled knowledge, even the latter public forms sufferfrom difficulties of ambiguity. That is to say that their content maynot be automatically and unambiguously absorbed into a furtherdestination data store, without human intervention to interpret thenature of the data contained and organise it at the destination store.

While file formats exist in their thousands, and are broadly invented tosuit the nature of any underlying application, each of these is designedfor a particular purpose, and rarely are the nature and contentadvertised for dissemination and absorption by third parties. In thesame way as above, files are also unable to be absorbed immediately andautomatically into a destination store without the skilled interventionof a developer, familiar with both the original data file and thedestination repository.

Where such files protocols are designed with a more general intent, suchas XML, they can indeed contain data that is useful, and can be absorbedprogrammatically into a target repository. However, this programmaticabsorption can be carried out only after a skilled developer hasanalysed the data schema involved, and written the absorption programaccordingly. For example, once a data schema is known and published,there exist mechanisms in XML to declare the schema to be of aparticular type, whose details are held in a DTD (document typedefinition) or schema. After the schema is examined, an absorptionroutine can be developed that can verify that subsequent documentssatisfy this schema, and can then absorb data as required. It is notpossible to absorb such an XML document, without prior examination atleast in the first instance of a particular schema by a human operator.

The applicant's earlier published patent GB 2,368,929, describes afacility for flexible storage of general data in a single file format,and provides a generalised relational expression for expressingrelations between data items. However, that facility focuses on aparticular format that, while having a minimal overhead, uses a typicaland proprietary data format that would in course suffer the samevulnerability to change or error as any other proprietary format.

We have therefore appreciated that it would be desirable to provide aformat that goes beyond those readily and currently available; inparticular, a format that would make it possible for an application toencapsulate data in a manner that allows its later absorption into adestination data store without human interpretation being necessary, andwhich therefore supports an automated approach to data merging ofanonymously contributed data into a destination data store.

We have also appreciated that despite the success and popularity of thevarious protocols that dominate data storage, transfer and display inthe industry today, being respectively RDBMS (relational databasemanagement systems), XML, and HTML, it would also be highly advantageousto provide a data store that is unrestricted in scope, and essentiallyunrestricted in size also (subject to appropriate clustering routines tomanage a plurality of discrete and necessarily fixed capacity storagedevices). While it is true that databases and data repositories havebeen built to large and essentially unlimited scale, these databasesretain their restricted schemas which prevent new information beingabsorbed arbitrarily and without human intervention to modify the schemawhere necessary.

SUMMARY

In one disclosed embodiment, an unrestricted binary unambiguous file ormemory mapped object that may be used to store data of any type isprovided. As used here, the term ‘binary unambiguous’ is intended torefer to a quality whereby the binary data stored within the datastore(file or memory map) is always and uniquely identified by a binary typeidentifier readily discerned from the self same map. Similarly, the term‘unrestricted’ refers to the capacity of the protocol to accept data ofany type, nature, format, structure or context, in a manner that retainsthe binary unambiguous nature of the invention for each data item.

A storage object so created may then be easily read by dedicatedsoftware, as it is of simple definition and is durable in nature, sinceits generality removes the need for repeated updates and versions of theunderlying protocol. A description of example reading and writingsoftware is provided.

The nature of the disclosed technology eliminates the need for externalschema documents, reserved words, symbols, and other arcane provisions,invented and required for alternate models of data storage. It is commonin the art that data protocols are restricted in many ways, principallyby schema (restricting context, relationships, and types), by standardtypes (with typically limited support for non-standard types) orsymbology (commas in a CSV file, <and > in a markup file (XML, html)).Any such restriction limits the scope of data that may be contributed toa store, and typically results in requirements to declare versions ofthe file protocol in such a way that the particular set of specialsymbols and keywords can be publicised and accommodated by developersskilled in the art, and which precludes an automated generalised routinefrom manipulating an arbitrary file or data store in any but a trivialand inadequate manner.

The present embodiment eliminates these restrictions, and so provides anovel means of unambiguous and spontaneous contribution of data in anunrestricted and arbitrary manner, sufficient to allow true automatedprocessing of novel data in a way that is impossible to replicate withthe common popular standards of SQL, RDBMS, XML, CSV and other storagemedia.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the disclosed technology will now be described in moredetail, by way of example, and with reference to the drawings in which:

FIG. 1 is an illustration showing the logical structure of recordsstored in a data structure such as a memory map or in a file;

FIG. 2 is a schematic illustration of the structure shown in FIG. 1;

FIG. 3 is an illustration showing in more detail an example file storedaccording to the preferred data storage protocol;

FIG. 4 illustrates a memory map of a device, on which data according tothe example protocol is written; and

FIGS. 5 and 6 illustrate a system utilising the example data protocol.

FIG. 7 is an illustration of particular records from the file shown inFIG. 3, as they would be logically stored in a Relational Database.

FIGS. 8 and 9 illustrate the basic processes for reading and writingsingle records respectively;

FIG. 10 illustrates a basic process for initialising a file;

FIG. 11 is an illustration of an example process for preparing a ‘write’buffer prior to writing to a file;

FIG. 12 is an illustration of an example process for writing records;

FIG. 13 is an illustration of an alternative example process for writingrecords;

FIG. 14 is an illustration of an example process for declaring a type;

FIG. 15 is an illustration of an example process for declaring data;

FIG. 16 is an illustration of an example process for extracting recordbytes from a file;

FIG. 17 is an illustration of an example process for reading data.

DETAILED DESCRIPTION

The preferred embodiment of the invention comprises a binary mapped datastorage protocol, for the explicit storage of data of arbitrary binarytype and arbitrary content, which may be implemented in memory such as adisk hard drive file, or even as a stream, though special care needs tobe taken for consistency in that case.

In particular, the preferred embodiment provides a desirable quality ofa truly durable and open data storage, which is that it should beentirely independent of keywords, magic numbers, prior definitions anddesign, and limitations in definition and scale, while at the same timeretaining its capacity for unambiguous data storage. By supportingentirely novel spontaneous and arbitrary contributions and types, thepreferred embodiment eliminates the need for versioning in which newkeywords, symbols or mark-ups are added (for example in other systems toexpand their scope). The preferred embodiment is therefore a simple,elegant and unique solution to the proliferation of myriad variations ofproprietary data storage.

In the following discussion, the reader is requested to bear in mind onepossible purpose of the protocol, namely a datastore that can beaccurately dissected into its constituent data items in a manner wherebyeach data item is characterised by a unique binary type identifier,without resorting to keywords or special characters, and in such amanner therefore that an automated algorithm will suffice to accuratelywrite a file compliant with the format, and to read data from such afile or storage device, so eliminating many of the circumstances inwhich a skilled developer would be required to intervene, if say one ofthe current popular and alternative protocols were used in its place.

As noted in the introduction, one of the current most popular dataprotocols is XML, which despite its supposed generality creates ineffect an entirely new file protocol every time a novel schema isinvented by a user/developer. In effect, this means that file contentcannot be accurately processed by a computer program until auser/developer has first examined the novel schema, and thereafterwritten code appropriate and consistent with the developer'sappreciation of the intent and content of the file as defined by theschema, and associated documentation.

Thus, far from being a ‘general’ file protocol, in fact the XML protocolinvites a proliferation of effectively distinct storage protocols, eachone bound by its schema, and each one requiring an entirely novelexamination by a skilled developer before that novel protocol can beaccurately processed.

The preferred system proposed in this application, by dispensing withmany of the encumbrances of existing systems, may appear at first glanceto be a combination of features, commonly or readily achieved in theart. However, such a view would fail to recognise the significance ofthose features in combination, or that the storage protocol is strictlydesigned at the outset to achieve something which no other protocol hasachieved, namely a capacity (when suitably utilised) for a trulyhuman-independent, binary format that can be read, examined by astandard computer algorithm, and automatically manipulated for thepurpose of absorbing its data into a destination data store without anyprior examination by a human being, and without a necessary creation ofa data definition document or schema, which in itself would requirehuman intervention.

Given such a truly automated process, then it would be possible, limitedonly by physical constraints such as storage and processing capacity, toabsorb all compliant data documents contributed in this format into asingle store without a limiting schema; and so provide for the entityowning and supporting the store a single point of contact for all datawithin the scope of the supporting and client organisation. It isenvisaged that such clients may be the population of users of what iscurrently the web; and the data stored therein may be all data,structured and unstructured, that the world may choose to commit to thatstore.

In short, and going far beyond any existing protocol, none of which weredesigned with such a goal in mind, it would be possible to build adatastore or virtual datastore (much as the web is a virtual network, inthe sense that there is not one network, but many) with unlimitedcapacity, global scope, and containing all information extant in theworld that the world had chosen to contribute to the store.

The features and characteristics of exemplary embodiments of thedisclosed technology will now be described. Also, to aid understanding,we provide a glossary of terms used within the description:

Protocol—a set of specifications for how data may be written to, andread from a storage device—any reading or writing application or processwill necessarily embody the protocol in software code or in hardware;Binary Type—the type of data that is represented by the binary encodingwithin the computer. We may refer to such types by their intuitivenames, such as #string, #integer, #float, #html, #image, #audio,#multimedia, etc. However, such references are only for readability, andare not explicitly meant as binary type identifiers required by theprotocol.

For clarity, the distinction between conceptual binary types, and binarytype identifiers is worth making. A ‘string’ in its conceptual form is asequence of characters. A skilled programmer appreciates that thecharacters are binary code, chosen according to a particular conventionto denote letters and symbols. ‘String’ as a binary type identifier is areserved word that requires some form of versioning to identify adesignated interpretation or format for that binary type. As a result,user/development involvement is required as protocols and versionschange. In contrast, the preferred embodiment provides means for binarytype identification without dependence on keywords, markup or specialsymbols, thereby eliminating the need for such involvement.

Standard Type—a proprietary definition of a binary data type providedwithin a software application, operating system, or programminglanguage. Standard data types are usually denoted using reservedkeywords or special characters. As noted above, in the preferredembodiment, no proprietary standard types are stipulated. The preferredprotocol does of course rely on binary types to be defined by users ofthe protocol, and proposes a root binary type which can be used in themanner of a standard type by way of common usage rather thanrequirement. The provision of binary type definitions therefore remainsflexible and adaptive. See sections 9 and 13 later.

Gauge—specifies the length of the data records in the protocol in bytes,and how many of those bytes are used to refer to simultaneously a datareference (Record ID) as used within the data segment of a record and abinary type identifier (Type ID) which as described elsewhere specifiesthe binary type interpretation appropriate to the data segment in therecord—thus, a protocol having a gauge of 4×20 indicates a record of 20bytes in length using 4 bytes to refer to the binary type identifier ofdata.

Self Referential Files—a characteristic of the example system, inparticular denoting a file that contains a plurality of records to storeboth data and binary type identifiers for the data. The file is selfreferential in that in order to determine the binary type identifier fora particular record of data, the store refers back to records declaringbinary identifiers, and the records declaring binary type identifiersrefer to a root record, which in turn refers to itself.

Record—a subdivision in a region of memory that is uniquely addressableand is used for storing user data. Records receive a unique recordidentifier (Record ID or Reference, abbreviated as ID or Ref). In thissystem, each record is deemed to contain user data of only a singlebinary type, and is provided with an explicit binary type identifier sothat a computer algorithm may accurately process the data based onrecognition or otherwise of that type.

Fixed Record Length—the amount of memory in bytes (or other suitablemeasure) assigned to each individual record is predetermined by theprotocol, and is irrespective or the length of the user data that is tobe stored. Thus, more than one record might be required to store aparticular instance of data. In the example system, each record has thesame length.

Document, File or Map—In the context of this discussion, the name givento the memory space used to store all of the records, Document or Fileis typically used in the context of hard disk files. Map is typicallyused where the embodiment is stored within random access memory.

Next, we derive features of embodiments of the disclosed data storagemeans and protocol from first principles so that the reader may fullyappreciate the impact if such apparently simple rules were ignored.

1. The Map Originates at a Fixed-Starting Point.

That is to say that the protocol is appropriate for use where a fixedstarting point to the map can be externally determined, such as with afile or memory mapped object. We refer to that starting point as byteoffset zero, as commonly done in the art.

The alternative is to have a format with special characters to‘interrupt’ the flow of 1's and 0's, and so indicate key boundaries.Examples are the commas in a CSV (comma separated values) file, and/orthe newline and carriage return characters in such a file, or document.

Equally, protocols such as XML and HTML rely on the use of < (less than)and > (greater than) characters to delimit internal structure. Suchspecial characters, where they comprise actual data, must therefore becarefully differentiated by further special characters (&nbsp; forexample in HTML is a ‘true’ (non-breaking) space since a space in anHTML file is essentially ignored (whitespace).

Once special characters are admitted, then special rules need to beinvented to deal with situations where those characters are not intendedto be special, which commonly requires the proliferation of yet morespecial characters.

As it happens, both HTML and XML can both be considered ‘document’protocols which satisfy the fixed starting point requirement, andimplement special characters for a different reason (internal structure,and relational data, both of which are handled differently in thisprotocol).

Nevertheless, the example illustrates the extra burden that specialcharacters place on the user (and since we intend to eliminate the useras developer, therefore the interpreting algorithm), therefore the fixedstarting point is simply the first such case where a design decision hasbeen made to avoid a particular problem, here special characterseparators in an open ended stream.

2. The Map Comprises an Integral Count of Records of a Size and NatureSpecific to the Embodiment.

The nature and purpose of the preferred system is the arbitrary storageof data of unspecified nature but explicitly declared (we will definethis more clearly momentarily).

It would be extremely unusual to consider storing just a single item ofclient or user data in a data repository (though not by any meansimpossible, as in a message implementation), therefore it is a necessarydesign feature to fit the map to the purpose that there should be ademarcation between data entries which is not of the nature of specialcharacters, for reasons outlined above.

The alternative to a special character however is no character, (elsewhichever character is chosen becomes ‘special’, be it a ‘newline’ \ncharacter, a keyword (EOL) or any other embodiment).

That being the case, the boundaries must be assigned withoutdemarcation, and so be implicit in the document, and therefore explicitin the protocol. The demarcation protocol could be of any nature(‘starting point of subsequent record is starting point of priorrecord+length of prior record’), but such would be unhelpful in thepresent scope of the disclosed technology, and so a simple fixed recordlength is selected for the purpose of ease of calculation of binaryoffset, and for familiarity. (Fixed length record stores are common inthe art). Thus, we require that the records within a document are of asingle fixed length.

3. The Records within a Document are of a Single Fixed Length.

The alternative (variable length records) would require either a bizarrealgorithm, special characters denoting record end, or conceivably a‘record length’ count in for example the first 4 bytes of every recordincluding the first.

Thus it would be possible in such a scheme to get to the fourth recordby noting the length of the first record, advancing that number ofbytes, reading the length bytes of the second record, advancing thatnumber of bytes, and continuing on in this manner until the fourthrecord is reached.

That iterative procedure is clearly cumbersome and disadvantageous, sois disavowed in favour of the fixed length approach.

4. The Length is Fixed Across the Entire Protocol.

While it may be commonplace to find embodiments of fixed-length records,it is somewhat less so to find such that insist upon a single commonlength across the entire protocol. That is to say that for a singleembodiment of the protocol itself, every file shares the same recordstructure. Thus it is sufficient to know (or be informed) that a file isof a structure conforming to the preferred protocol to read itsuccessfully (in the manner described below).

As will be demonstrated, there are arguments for various possible recordstructures, each of which offers in particular different capacities, butat the current time, where computers readily work with 32 bit integers,and hard drives are of commonly 20, 40, 80, or up to 100+ gigabytes, arecord format (described below) is provided.

5. Records are Referred to by Integral Id, Monotonic Increasing, andOne-Based.

With a fixed starting point, and fixed length records, it is simple toprovide each record with an implicit record index or identifier, as a1-based, monotonic increasing integer.

The binary offset at which the nth record is to be found is readilycalculated then as (n−1)×(record length), with the first record (id=1)starting at binary offset zero.

This is common in the art. What is less common is that record length canbe constrained across an entire protocol, regardless of intended use, asnoted in 4.

There is nevertheless a choice which we should clarify, commonly between0-based and 1-based indices. Although it is common (as we do) to referto the first byte as being at offset zero, or likewise for the firstitem in an array, it is also true that zero is the default(uninitialised value) for many coding languages, and it would be veryeasy to commonly and unintentionally refer to record-zero when in factthe variable had simply not been initialised.

Therefore, we consider it a design criterion that the record identifierbe 1-based. Likewise, it is then safe to return zero in functions thatmight be expected to get a record id as their result, when they fail.

6. Record Identifiers are Positive (Greater than Zero).

This may seem trivial, but in conjunction with the gauge, sets the upperlimit for a valid record id, as will be seen in a moment.

For a gauge using 4-byte references for record identifiers, there existsa choice between allowing an upper limit based on the common ‘int’(signed 4 byte integer) binary type, or using the upper limit of theunsigned integer type. While the latter would provide a greater upperlimit (approximately 4 billion compared with 2 billion), it wouldintroduce ambiguity where the coder compiled reader/writer applicationsusing the more restricted signed int32 type, so that record identifiersbeyond 2 billion (int.MaxValue) would require special handling.

For this reason, we prefer to limit the protocol to the safer, lowerlimit of the signed integer representation of a particular gauge, thusInt32 rather than Unsigned Int32, for a 4-byte gauge.

7. Record Identifiers as a Maximum are 1 Less than the Maximum PositiveNumber

In fact we restrict the maximum record ID to one less than the maximumpositive representation. This avoids a further ambiguity, as a commoncoding loop may look like (for int i=1; i<=intmax; i++){ }.

It is a subtle error, but i cannot reach (intmax+1), where it wouldnormally terminate, since by definition intmax is the largest integerthat can be held. The counter i will then cycle back to intmin, and theloop will never terminate.

It is safer therefore to highlight this by regarding (intmax−1) as beingthe largest valid record ID, where intmax is the largest positiveinteger that can be represented, using the reference size (to bediscussed) that defines in part the embodiment.

It may not be apparent why there should be a limiting size based on anarbitrary reference size (see later), when it would surely be possibleto simply store the record ID in an int64 (8 byte integer) for example.The need for such will be shown shortly.

8. Records are of Arbitrary Binary Type.

Since we intend to provide a general storage medium for any binary data,of any type, in use now, legacy, or unknown as yet and to be invented inthe future, we need therefore to allow records to store data of anybinary type. The mechanism for this is illustrated in the sectionsbelow.

9. There are No ‘Standard’ Types Intrinsic to the Embodiment.

Most protocols opt for short term convenience of the (human) user overthat of a generalised interpreting algorithm. Thus they tend to beadvertised with a limited set of initial types, described and declaredtypically using text keywords, which are then expanded over time asusers find more types convenient. See discussion of binary types andstandard types above.

The standard types of course, like special characters, then requirespecial characters, or keywords of their own. These must be advertised,published in books, and learned by users, who when developinginterpreters must look for these special keywords.

Further, any interpreting algorithm developed for an early release of aprotocol must subsequently be revised or rejected, if a later version ofthe protocol is released to accommodate a widened variety of types, (ormodified structure). Since it is our aim to release a ‘one-off’ or‘eternal’ protocol, it is nevertheless apparent that simple rules makefor durable protocols.

XML is by contrast a more complex protocol, (largely due to its intentto create internal structure), but its roots are equally sound andsimple (a hierarchy of lessthan/greaterthan braced element pairs<element></element>), which in large part accounts for its popularity.

Nevertheless its reliance on ‘<,>’ special characters, keywords (eg:CDATA), and arbitrary types, currently popular, make it vulnerable tomodification, should popular demand suggest a new implementation, atwhich point current interpreters will become inadequate. (XML isinadequate for our purposes for many other reasons, but this iscertainly one of them).

We eschew ‘standard’ types identified by keywords, and seek a binaryunambiguous, declaration of binary type. The means by which standardtypes are eliminated in the preferred embodiment is by theself-referential binary type declaration, as discussed below.

10. Binary Type is Identified by Unambiguous Binary Identifier.

An accurate interpretation of the otherwise meaningless binary 1's and0's, depends on identifying a binary type. Tautologically, the binarytype is an invention by human beings (or convention) as to how tointerpret data, whence algorithms for the appropriate creation of bytesfor storage, and interpretation of bytes on retrieval can be coded.

Interpretation further requires the accurate association between aspecified set of bytes, and a binary type identifier, which itselfdesignates a binary type recognised by convention.

The correct interpretation of bytes therefore requires three elements:

1) a (human) convention as to a hypothetical binary type, e.g.‘big-endian 4-byte signed integer’;2) an identifier for such within the storage protocol or coding language(e.g.: int, Int32, integer, long—all of which are variously used todesignate the same thing in the art, according to context); and3) the assignment of the identifier to the specific bytes in question.

We have considered the impact of these necessary steps, and theirassociated embodiment in current protocols, and have adopted animplementation in the current protocol that provides stability andlongevity in the sense of essentially no versioning, and automatedinterpretation of data.

As regards the first step of the above three, we have eliminated thelimiting feature of human-invented pre-designed types being consideredas part of the protocol (no standard types as noted above).

As regards the second step, we have further eliminated the limitingfeature of human-invented keywords to describe such binary types, whichotherwise would require versioning as future types needed to beaccommodated.

As regards the third step, we have further insisted that the binary typeassignment to data be performed locally, within the file, so that noexternal resource is required to accurately determine the identity ofthe binary type by which the data is stored.

Thus, each distinct data item or record in the system may be rapidlyassigned a binary type identifier, based upon which further moreadvanced processing may follow.

11. A Self Referential System Mandates at Least One Root Identifier

The presence of binary type identifiers within the file, without theirbeing hard-coded into the protocol, suggests that they themselves mightin some fashion be considered ‘data’, and as such have a binary typeidentifier of their own.

Thus, in embodiments of the disclosed technology, binary data (thecontent of the file) has associated binary type identifiers, which bythe argument above are themselves data, with their own binary typeidentifiers, which if they are not to resolve into a circular argumentmust terminate in at least one ‘root’ binary type identifier.

The choice of the binary type identifier for such ‘root’ elements, andthe choice of binary type to be represented by that identifier, is afurther element in embodiments of the disclosed technology as discussedbelow. This choice of binary type and binary type identifier, along withgauge, determine the particular embodiment of a generalisedself-referential format.

12. Preferred and Alternative Root Binary-Type Identifiers.

Globally Unique Identifiers (GUIDs) also known as Universally UniqueIdentifiers (UUIDs), which are well known in the art, provide means foridentification that can, in practice, be considered unique. In thepreferred embodiment, GUIDs (UUIDs) form the basis of binary typedeclaration.

An example embodiment of the self-referential data system is thereforeone whereby the root binary type is of binary type GUID (aka UUID), andthe gauge is 4×20, being 20 byte records, with 4-byte (signed integer)reference, as described earlier. The requisite identifier for theGUID/UUID binary type may be {B79F76DD-C835-4568-9FA9-B13A6C596B93} forexample. The means by which these declarations are made in practice willbe further set out later in the document.

In alternative embodiments, however, other types of identifier could beused to suit requirements. It is possible for example to remainconsistent with the self-referential underlying file protocol of thedisclosed technology, while maintaining multiple root declarations.These may indicate entirely different binary-type identificationprotocols, such as a root binary type and subsequent binary typesequally declared by a root string and subsequent strings instead ofUUIDS, in addition or instead of a root declaration indicating aUUID-based declaration referential hierarchy.

However, in the same way that a markup file might contain both an XMLdocument or segment and an HTML document, but that in practice it iscommon and preferred to keep these separate and to have single-usedocuments, it is a preferred feature of the embodiment that binarystores using the protocol restrict themselves to a single common root bywhich subsequent binary types may be identified.

As explained above, although UUIDs are preferred, the embodiment makesno restriction on what root identifiers are used. The generality andsimplicity of the protocol is such that even if a further rootidentifier became popular, perhaps by means of pursuit of dominance ofthe standard by a third party, then by simple recognition of itsexistence, all such files using that root would become once moretransparent and automatically open to process. While a party can isolatethemselves if they wish by adhering to an arcane and unusual choice ofidentifiers, they cannot dominate the standard, any more than any singleentity can dominate a particular spoken language.

13. Standard Types are ‘Common by Usage’ not by Declaration.

To revisit briefly the earlier comment on standard types, a standardtype may not exist by ‘keyword’ declaration, nor is it desirable toinsist upon a formal recognition of a standard type, at the expense ofbeing inflexible as regards future requirements.

That does not preclude however ‘advertising’ preferred identifiers forcommon types, and it is anticipated that as with IBM and the PC, andMicrosoft and most everything else, when and if Microsoft and/or theLinux community choose ‘preferred’ identifiers, they will likely becomecommon standards.

Thus, it is envisaged that users of the protocol can and will informinterested parties as to their preferred identities. However, suchidentities are options and choices only. They are not an integral partof the protocol, nor ever should be assumed to be so.

14. Each Record has an Explicit Binary Type.

‘Blobs’, meaningless bytes (meaningless as in ‘of undeclared type’) areof no interest to us, nor we hope to the data community at large. Thereis very little value in being sent a series of effectively random 1'sand 0's, and while hackers may rejoice in dissecting blocks of binarydata to discern patterns, and content, we do not, and nor do werecommend or desire it to be supported by our protocol. A record withoutan explicit binary type is therefore in our view meaningless, as data,and we therefore require that every record intended for interpretationas data to have an explicit binary type.

It is also emphasised that such binary type declaration (the integerTypeID) must be declared by self-referential declaration (a binary typeidentifier in the same file) and not by ‘common usage of a knowninteger’ (eg: 3=Int32, 4=string). See the discussion of standard typesin section 13 for the reasons.

15. Records without Such a Type are Ignored as Data.

We do not however require that an interpreting protocol fail for want ofan explicit type. It would then be easy for a careless or malicious userto intentionally corrupt such packaged data for precisely this purpose.

We do however intend that data which is untyped should not be treated aslegitimate for the purposes of normal engine functions, data exchange,or data absorption.

16. Private Usage of Untyped Data is Overlooked.

As long as no inference is made about such data for the purposes of dataexchange, data description, or data storage, then private usage ofuntyped data is overlooked. Meaningless’ (for public data purposes) doesnot quite mean ‘useless’.

One such use can be, for example, to list a series of ‘flags’ at thebeginning of a file, which while not formally data, can be an indicatorto the engine, as to source, style or other information.

What they are not is formal data, and any attempt to read them shouldfail, or return a warning. (We distinguish between tolerantfailure—recognising data as untyped, and politely refusing to read orsupply it; and intolerant failure, where the application aborts. We donot consider it appropriate that the application should abort).

Further, any such usage must still comply with the fundamental filestructure being set out herein. There will be no tolerance for corruptedstructure files, ‘special’ headers or the like. The protocol is strict,and simple, and for good reason.

Untyped content is tolerated, but is not considered ‘true’ or good data.Corrupted structure is never tolerated.

17. Each Record has an Intrinsically Declared Binary Type.

That each record should have an explicit data type does not in itselfspecify how that type should be specified (in terms of internal recordstructure). It would be possible to implement many styles of binary typerepresentation.

Firstly, one possible representation might be that the type may or maynot be integral to the record. It may be stored as a separatedescriptor, as with fields in relational databases. There, data typesare commonly stored by field not by individual record. It would beincredibly wasteful in a protocol with fixed field/binary typeassociation to repeatedly store the type in every field-value.

Our ‘records’ however are not intrinsically structured data in the senseof an RDBMS. Rather they are more akin to individual slots, holdingarbitrary data, which may or may not have an internal structuralrepresentation. They inevitably will, since only truly random bytes haveno intent to be ‘interpreted’, and that interpretation will requireunderstanding and structure, even for something as simple as an integer.

Since they are arbitrarily assigned slots of arbitrary type, wetherefore require that each record or slot should have its own intrinsicbinary type declaration.

18. Binary-Type Byte Allocation.

If ‘standard’ types were allowed, a possible means of binary typedeclaration might be then that a single byte would suffice, with up to255 different types (and 0 for untyped), as a binary type declaration.Further, such types could be ‘hard-coded’, such that 1=int, 2=doubleetc., as is commonly found in other (binary) protocols. C++ enumerationsfor example comprise precisely this style of hard-coded integers.

However, we have already indicated that binary type should preferably beindicated by GUIDs, which are themselves 16 bytes long (as binarydata—their string representations are longer, and variable, but we referonly and explicitly here to their binary representation).

However, we do not wish to store a full 16 bytes as binary typedeclaration, in each and every record. This would be foolish, given thepreponderance of data generally to fall within a limited set of commonlyused types, at least for a particular user and application, as storingthe binary type in each and every value entry in a database. Thus, wehave appreciated that it is advantageous to use or allow some form ofreferential identity to specify or declare data types.

19. Self-Referential Binary Type

The self-referential binary type is an element in embodiments of thedisclosed storage protocol that helps ensure that files are bothself-contained, binary unambiguous and stable for the purposes ofreader/writer algorithms. In the example system, it is by design thatonly records are stored in the data store. There are no sub-divisions orpartitions proprietary in nature or otherwise difficult to determine. Toappreciate the structure of an entire store in this protocol it issufficient to understand this simple but strict adherence to agauge-based fixed-length record structure. This is by design.

A record declaring an original root binary type is a record containing aGUID—the GUID acts as an identifier for that binary type. As the recordcontains a GUID, the record itself it must be of type GUID, and musttherefore include a binary type reference to the record that declaresthe binary type ‘GUID’.

By inference, therefore the record that binary type points to must alsobe of type GUID, and must contain the GUID identifying the type GUID. Inturn that record must point to itself, to identify its own binary type.

Thus, the storage protocol is self referential with respect to binarytype in two senses: every record has a binary type declared by GUID,which is declared in the same file; and the root of the GUID hierarchy,of type GUID, points to itself.

If we store the binary-type GUID within the data store, and since it isintrinsically a globally-safe identifier, so it immediately releases usfrom externally defined or derived URLs, schemas, or other forms ofvalidation.

That is not to say that a human understands what to do with an arbitraryGUID, as they are essentially 16 byte random numbers. (Skilleddevelopers will appreciate that they can be more than that, but it issufficient for this explanation to consider them as such). Rather it isto say that a computer recognises a GUID as a common programming type,which can be used as an identifier and indicator as to furtherprogramming requirements.

Reference shall now be made to FIG. 1, which logically illustrates thedata structure outlined above. The figure shows a table 2 representingthe usage of memory space in a computer system. It will be appreciatedthat the memory space could be provided as dedicated computer memory, oron a portable memory device such as a disc or solid state device. Ifprovided as dedicated memory within a computer, the table is effectivelya memory map. Otherwise, the table typically corresponds to a file.

The top left corner 4 of the table represents the first byte, byte zeroin the memory map or file. The table then comprises two columns, and aplurality of rows. Each row is a data record.

A first column 6, called the Binary Type column, is used to store areference to a record, in order to indicate the binary type of anysubsequent data in that row. The second column 8 is used to store data,and is called the Data column.

Counting from byte zero in memory, a subsequent predetermined number ofbytes n1 of the file or memory space are reserved for storing the firstentry or instance in the binary type column. The next contiguous sectionof bytes, number n2, is then reserved for the first entry or instance inthe data column (the widths of the columns in bytes will be explained inmore detail below).

Together, the bytes reserved for the first instance in the binary typecolumn, and the bytes reserved for the first instance in the data columnconstitute the first record. The record number is indicatedschematically to the left of the table in a separate column 10. It willbe appreciated that column 10 is shown purely for convenience, andpreferably does not form part of the memory map or table itself.

In repeating fashion, the next record is comprised of the next n1 bytesof memory or file space for the binary type entry, following on withoutbreak from the last byte of the previous record, and the next n2 bytesfor data.

Although the table shown in FIG. 1 is useful for purposes ofillustration, it will be appreciated that there is nothing stored inmemory itself that defines a table, or even a table like structure. Thebytes in memory are reserved either to store a binary type indicator, orto store data. The memory usage is therefore likely to look more likethat of FIG. 2, with the shaded boxes representing space reserved forbinary types, and the blank boxes reserved for data. The apparentlyrandom structure of the diagram however is simply to confirm the lack ofmarkup or designators. In practice, since the record lengths are fixed,it is easier to think in terms of the regular, structured tableillustrated in FIG. 1. Note, that there is no table of contents includedin the memory space or file. Instead, records are accessed by movingthrough the memory or file in increments of (n1+n2) bytes. As a result,n1 and n2 are fixed throughout the memory or file as discussed above,and the records begin at byte zero.

20. Binary Type Plus Data is Sufficient for Each Record

It may seem obvious that if we've finally declared a type, then the restshould be data; but in fact there are (at least) two reasonablecandidates for inclusion into the record structure.

a) Record ID b) Data Length 21. Record ID is not Required in the RecordStructure

The use of a Record ID would offer ‘confirmation’ that we had the rightrecord, if we included the record id in each record. Further, it wouldoffer security in ‘open-ended’ streams, where bytes may be lost, thateach new record was indeed as advertised, and of the appropriateidentity.

In practice however, the fixed-starting point, fixed-record lengthprotocol is entirely robust without such a mechanism, so that iseschewed. The security check in the open ended stream is better dealtwith separately, by the selected protocol/embodiment responsible forpassing/receiving the stream itself. As noted earlier, in a fixedstarting point, fixed length file, the record ID can be inferred fromthe binary offset and vice versa, reliably and effectively. There istherefore no need in the preferred embodiment for a record id withineach record/slot. However, should a user require an embodiment withexplicit record identifiers to be stored as part of the record, thiswould be possible.

22. Data Length is not Required in the Record Structure

This does not preclude a given binary type including its own lengthdata. BSTR's (Binary Strings) for example have a length prefix, whereC-Strings (known in the art) do not, but are null-terminated (havecharacter zero where the string terminates). The protocol need onlyensure that sufficient bytes are stored to cover all the bytes that werepassed by the contributor.

Since the records are of fixed length, if there are fewer bytes passedin than are required to complete a record, the remaining bytes arerequired to be set to zero.

If the data contributor requires either a notation of the exact numberof bytes passed in, (rather than the storage capacity allocated), theymay declare a binary type with length integral to (i.e.: held internallywithin the databytes of) that type. The protocol is therefore effectivewithout the requirement for an explicit length specifier for each dataitem or class of items.

23. Data Length is Ambiguous

In fact, the concept of data length, which seems so obvious, isintrinsically ambiguous. How long is the data ‘Andrew’? It is temptingto say 6 bytes. However, with a terminating zero it would be 7. Indeed,if it were encoded as Unicode, it would be 12 bytes.

Whereas if it was passed in a 100 byte buffer, the protocol wouldreceive 100 bytes, and it is only an opinion that only 6, 7 or 12 ofthose respectively are significant.

Thus, data length is inseparable from human opinion. Therefore, not onlydo we not regard data length as necessary, we regard it as outrightambiguous and unhelpful.

24. Data is Stored at Least to the Last Significant Byte.

In the light of the above, especially where buffers are concerned, a 10k (10,000 byte buffer) holding the string ‘Andrew’ will rapidly eat upstorage capacity if we attempt to store every trailing zero.

On the one hand, the client engine may intend a dynamic record, and willuse the empty space later. On the other, they may simply have been usinga convenient buffer, but it is our store that will fill up rapidly andunnecessarily as a result.

We will not attempt to ‘interpret’ the data as a null-terminated string(i.e. look for a first zero and terminate)—that is not our job; and aninsidious route, to believe that we can reasonably understand andinterpret a number of types, to be ‘helpful’. To be helpful, is to riskmaking inappropriate assumptions. Better to be strict and simple, andlet the contributing/reading engines be ‘helpful’, as they see fit.

It is preferred however to avoid storing myriad zeros ‘unnecessarily’.This does not restrict the user, as shall be explained. The protocolpromises therefore to store at least to the ‘last significant byte’(last non-zero byte), and it may indeed store all the trailing zeros.However it is considered to be a matter of the discretionary embodimentwhether it does so or not, nor need it maintain any record of theincoming buffer size. If the user needs that size specifically they canthemselves define a binary type that includes that information andsubmit that as data.

25. Records May be ‘Reserved’ to Cover a Fixed Size.

Where a block of data is required for later filling with data, but thedata is not yet ready, or the engine simply wants to see if there isenough room available, then it may ‘reserve’ a block of records byinsisting on a fixed size, specified either in bytes or records (werecommend bytes, which is more intuitive, and also errs on the side ofcaution, if the user inadvertently specifies records). It can do bysimply adding a block of records of sufficient capacity.

This takes us ahead to data which exceeds the record data length, whilewe need to finalise and clarify the individual record structure.

26. Gauge

The gauge defines the internal structure of records and files. Like arailway gauge, neither the reference size nor data length (remainingdata bytes per record) need to have particular dimensions; except thatonce specified, they become a single, final and permanent feature of theexample system, and all files with identical structure (and obeying therules for self-referential binary type) are therefore by definitioninstances of the same identical protocol.

In the example system outlined earlier, files are of integral recordcount, records are 20 bytes in length, with 4 of those bytes being usedto store an integer reference to another record in the file declaringthe binary type.

Once a gauge is specified, the capacity of the file can now bedetermined. Recalling that we allow only +ve (positive integers), withinthe meaning of the refsize, a 4-byte integer, which we treat as signedto be safe, restricts the file protocol to approx 2 billion records.(Strictly: max(Int32)−1)

For a 4×20 gauge, then, we therefore have a file size of approx 2billion×20 bytes, or 40 gigabytes maximum file size. (The figure isprecisely determinable since the maximum possible value of a 32-bitsigned integer is precisely determinable. We use the approximations heresolely for readability). The 16 bytes of the record not used for holdingthe 4 byte reference are used for storing user data.

Thus, for 16 bytes data per record, 2 billion×16 bytes of data can bestored, or approximately 32 gigabytes maximum data storage, of whichsome at least will be used (if the file is to be consistent with theprotocol) to declare the binary types of the data in the file.

(Note that the binary types do not have to be all declared up front.They only need to be in the file at the same time as, or preferablybefore (with earlier id) the record whose type they describe).

The 4×20 gauge is particularly useful because it results in a practicalfile size capacity, and a common refsize (abbreviation for referencesize, by which we store the binary type identifier) (int32), and becausethe 16 data bytes within the 4×20 gauge conveniently allows us to storea single GUID in exactly the data comprising a single record, (a.k.a. asingleton record, or singleton).

Other gauges could be used, provided any file or map indicated as beingof a particular gauge is internally consistent when interpreted, beingruled, with record borders every reclength (record length; abbreviation)bytes in that fashion.

If we chose a larger gauge, maintaining the refsize, but enlarging thedata to say 36 bytes, for a 40 byte total record, then the capacity of asingle file would go up to 2 billion (4 byte refsize signed int max,−1)×36 bytes (data)=72 gigabyte capacity. However, with GUIDs beingextremely common in the protocol, then any GUID record would use only 16of 36 bytes, leaving 20 bytes per record as simple empty zeros.

Against which, if the ‘natural’ data to be stored was of length 36bytes, or simply ‘large’, then the larger the databytes, the moreefficient the storage for that type. The final trade off will be againstcommon usage, efficiency, saving on common types vs wastage on eg:GUIDs, and the absolute single file capacity required.

27. Extension Records

With a fixed-length record, we are clearly limited in the amount of datawe can store in a single record. This is true of any data storagesystem, and even where input is of variable length, it is commonpractice to put an upper limit on the length of possible values. (eg:varchar[255] to indicate a variable length string up to 255 charactersmax). We consider this an unnecessary and limiting restriction.

The example system supports incoming data of arbitrary length, subjectto the remaining capacity of the device and/or protocol, by means ofextension records.

Since by design no magic numbers or special characters are allowed,these extension records must follow the same protocol as for any otherbinary type. Nevertheless, this is readily and easily done.

A binary type is declared as {gExtension} (or {gExtn}), where the{g[something]} notation indicates a binary GUID, but labelledconveniently for explanation and readability in this document.

Thus, {gUUID} or {gGUIDTypeUUID} may be used to indicate the binary GUIDused to declare items of type GUID, in other words the root of thebinary type declaration tree. Subsequent types (e.g.: {gString}) will beof Binary Type {gUUID}, but will have their own GUID for declaration ofsuch data, e.g. strings with associated binary type guid {gString}, anarbitrary binary type set aside to designate ‘string’ data, or asindicated with {gExtn} above.

The binary type {gExtn} is then declared as normal, and a record-type idderived, which is by definition (the protocol is self-referential forbinary types) the record-id of the record in which the binary type{gExtn} is stored.

This concept is illustrated in FIG. 3 to which reference should now bemade. FIG. 3 resembles FIG. 1 except that a binary type has beendeclared to indicate an extension record.

It will be appreciated that the root UUID {gUuid} and the extension type{gExtn} are the closest candidates to being ‘standard’ types which occurin the protocol, in the sense that they are commonly used, and by theirusage in conjunction, arbitrary data of any length can be stored in anotherwise fixed-record-length protocol.

Since the {gUuid} and {gExtn} types are as arbitrary as any other in theprotocol, it will be appreciated that any reading or writing process orengine may be considered tuned or sensitive to a particular root and/orextension type. It will therefore be advantageous for such fundamentaltypes to be registered as a standard externally for common appreciationand usage. Their precise identification however is not a pre-requisiteof the protocol prior to that time, as the essential nature, facilityand benefits of the protocol will be evident irrespective of the finalchoice of such identifiers.

As such and with the {gUuid} and {gExtn} identifiers recognised and inplace, any reading and writing process preferably therefore has codethat tells it how to respond if a record of the extension data type isfound. This is straight-forward however, as the extension record binarytype is used merely to indicate that the current record is an extensionof the record immediately preceding it. Thus the concatenated set ofdata segments from the contiguous series of data records (initial recordof non-{gExtn} type followed by a plurality of records of {gExtn} type)constitute a final single data item of arbitrary length, as originallysubmitted by a client application to the data store. Despite being astandard type, in the sense of common usage, it is pertinent to notethat it is only recommended for ease of data storage, rather thanrequired, and that in accordance with the other features of the protocolrequires no special codes or characters. Thus a message comprising dataconsistently of length within the capacity of the data-segment of asingle record may omit the {gExtn} declaration. It is nevertheless stilldesirable in practice to declare it, in order to confirm to thereceiving reader that this is in fact the known and recognised {gExtn}type in use.

In the Figure, record 4 is used to store the extension binary type. Asnoted above, the data in the record will be a UUID representing thattype for the purposes of the data and data control. Records 5 to 9contain a user binary data type declaration; and records 10 onwardscontain data specified as being of the variously defined binary datatypes.

28. Scalability—Enlargement by Clustering.

Since the protocol is of fixed record length, with fixed maximum recordcount as defined by gauge to ensure consistency with theself-referential goal of the protocol, it follows that a single storehas a maximum size and storage capacity determined by the guidelines ofthe protocol and the gauge selected.

At 40 gigabytes approx for a 4×20 gauge file, for example, that may beconsiderably in excess of any reasonable XML file, and yet it may onlyrepresent a fraction of a terabyte RDBMS database. Ideally, we would notwant the protocol to be restricted to such an absolute limit. Clearlyone solution is simply to partition the data across multiple files.

Since each has a capacity (in 4×20 gauge) of approx. 32 gigabytes dataper 40 gigabytes file, it is simply a matter of how many files to use tocontain the data you wish to store.

The only item requiring particular attention in such a basic model ofseparated data files is that a means of distinguishing references fromdifferent files be established. Clearly a reference ‘27’ in file A isnot except by extreme coincidence identical in type or nature to arecord ‘27’ in file B.

In practical embodiments we commonly use a GUID as a ‘Source’ Identityin conjunction with each reference, thus ensuring that references fromdifferent sources are not inadvertently comingled or used out of context(of their particular file).

A complex, sophisticated clustering routine can of course beimplemented, but the simple observation is that one file being full doesnot limit the final effective size of the data store. Clustering is arecognised technique in RDBMS, and in web farms.

While we do not intend to outline a full clustering algorithm here, wecan at least indicate that at its simplest, the means to expand avirtual data store capacity is simply to add a new file.

Identities are if (the protocol's recommendations have been followed)based on GUIDs, so simply put, the sum of the information across allfiles, is the sum of the information for that GUID in each file.

29. Scalability—Selecting a Larger Gauge, Databytes.

As noted above, the 4×20 gauge is useful because it results in apractical file size capacity, and a common refsize (int32), and becausethe 16 data bytes within the 4×20 gauge conveniently allows us to storea single GUID in exactly the data comprising a single record, (aka asingleton record, or singleton).

However the true scaleability of the protocol comes from promoting to alarger refsize (reference size, by which we identify the binary type).We have not fully explored why the protocol is useful, and how to useit, from a referential perspective (internal to the data, not simplywith regard to binary type), but if we allow for the moment that 2billion records simply might not be ‘enough’, and it is desired not tosplit across multiple files, then moving to for example an int64, wewould have approx 9 billion billion possible records.

With a gauge 8×16 therefore, with 8 byte (int64) refsize and maintaininga 16 byte datablock per record, the maximum file size would be approx 9billion billion×24 bytes, or in excess of 200 billion gigabytes; with adata capacity per file approaching 150 billion gigabytes. This is morethan enough for a single data file/document for the foreseeable future.If however need arises, by the same mechanism it is a simply matter toexpand the gauge by any preferred amount to encompass the requisitescope.

30. References: a Latent Operating System

The entire discussion to date has been focused on examining andoutlining very carefully the design decisions, consequences, and usageof what might otherwise appear to be a simple protocol.

Since it is not necessary to understand why references are useful beyondtheir usage for the declaration of binary types, we have not enteredinto a discussion of the merits of a referential system beyond thatrequired to explain the binary-type allocation, and in passing, to notein our example diagrams the usage of Triples, declared also inreferential manner, by means of record ID's as references within afurther data record of type {gTriple}.

However, the example described here is intended, as well as being ableto absorb information of an arbitrary nature, to be part of a systemproviding an automated and well-defined source of data in like manner.For such usage, an appreciation of references will be critical.

It will also be apparent that any system capable of operating withdistinction between value-based data objects and reference-based dataobjects approaches the preserve of a traditional ‘operating system’ suchthat if such an operating system may be considered to be a set of memoryacross which data and referential integrity are maintained for a set ofwell-defined operations, primarily storage and retrieval, then thisprotocol constitutes in large part the means to provide the basereferential storage for such an ‘operating system’, and thus may beconsidered to be the substrate by which by addition of a set of‘operating’ procedures a true ‘operating system’ may be implemented, asunderstood in the art.

That the protocol may be implemented as a memory map clearly identifiesit as a candidate therefore for at least an embedded and structuredstorage embodiment for a chip or otherwise dedicated processing deviceor medium; and by supplementing the referential store with appropriateoperating procedures, a true ‘operating system’ may likewise beimplemented on an arbitrary device, store, or medium.

Thus, far from being ‘simply another’ file protocol, the cleanliness,rigidity and simplicity of the protocol lend its use to strict,dedicated and high-performance applications, and make it a nascentcandidate for a data-focused operating system to sit alongside the twodominant and popular kernel (chip-focused) operating systems of Unix andDOS/Windows.

31. Summary of Characteristics:

The resulting protocol is extremely simple and effective. Understandingwhy it must be that way has been, step by step, a longer process. Tosummarise, therefore embodiments of the disclosed system possess one ormore of the following characteristics:

a) binary type identifiers (which in the preferred example are GUIDs)for data should be declared locally in the file as records;b) records containing user data should have a reference to a recordwithin the file defining the binary type identifier (preferably guids);c) the remaining bytes (typically following the binary type reference)should be data;d) the data records should in preference be declared ahead (lower recordid, though does not strictly matter) than the data records theydescribe;e) a file should contain a root binary type record (in the examplesystem a GUID), and a record defining a binary type should itself pointto the root record, since the binary type identifier in the preferredembodiment is an arbitrary instance of itself (by preference a Guidrepresenting Guids);f) the root record is self-referential;g) an ‘extension’ binary type allows the system to absorb data of anylengthh) records are of identical fixed length throughout the file and theprotocol, and begin at byte zero, so that they can be referenced withoutthe need for special keywords/identifiers;

Although, the discussion of each of these characteristics has beenchosen is lengthy, the final result is a simple gauge, a clearly definedfile structure, and a self referential algorithm, with GUIDs aspreferred identifiers.

The features individually, or together, may appear to be a trivialcombination no different from other possibilities. That this is not trueis most easily appreciated if the reader should consider which otherprotocols allow:

a) automatic reading for structure (proprietary RDBMs for example donot—an installed and proprietary SQL interpreter is required, ratherthan direct examination of the underlying data file),b) arbitrary and spontaneous declaration of data of arbitrary andspontaneous binary type, being nevertheless well defined;c) and which are automatically readable for such identifiers and suchdata.

It should be appreciated therefore that the protocol characteristicshave been chosen as contributions to embodiments of a truly general fileformat, capable of arbitrary contribution by anonymous third parties,nevertheless with the assurance that data of any type and nature (ifsupplied with an appropriate binary type GUID) can be safely andreliably stored.

Furthermore the resultant binary data file can be reliably identifiedwithout further installed readers or proprietary software beyond thatnecessary to follow the few clearly defined and simple rules describedherein.

The end result is crucial not simply for what is present, and for thecapabilities provided, but also for what is absent, and for whatpitfalls have been avoided. This prevents the protocol from being yetanother ambiguous and limited storage or messaging medium.

The example system therefore provides a data storage protocol that willbe flexible, durable, and support automated absorption, a facilityunique to our knowledge among all extant file formats and protocols, andabsolutely and certainly impossible with the most popular protocols, XMLand RDBMS.

RDBMs and similar ‘data’ systems for example rely on proprietary filestructures for performance, which are not readily dissected orunderstood, and which require intermediary parsers for access.

XML for example is not a ‘natural’ referential system and must be parsedsequentially into its constituent elements according to the markupcharacters in order to determine a final hierarchical document withinwhich further structure and references may be discerned.

By eschewing markup and by relying on fixed length records, the currentembodiment allows a reading application to jump from a reference in onerecord to an immediately and well-defined offset in the file comprisingthe target of that reference, by means of a simple arithmeticalcalculation.

This enables the preferred embodiment to act as both messaging protocol(akin to typical use of XML, for ‘small’ documents/data stores), and asa fully expressed and indexed data store akin to an RDBMS at the otherextreme, both with the same transparent and well-defined protocol.

The example system therefore has been carefully thought out to provide adata storage protocol that will be flexible, durable, and as indicatedmay support both low-key messaging akin to XML and high-mass, indexeddata stores, akin to RDBMS.

Furthermore, it will support automated absorption, a facility unique toour knowledge among all extant file formats and protocols, and one thatis certainly and absolutely impossible in the common usage of the mostpopular protocols, XML and RDBMS.

The proof and demonstration of such a facility will be the subject of alater application, being that of Fluid Data.

Having described exemplary features of the protocol, its operation andimplementation will now be discussed in more detail.

It will be appreciated from the above that data should not ever besimply ‘written en bloc’ to disk, disregarding the type protocol, andsimply writing eg: 150 data bytes in sequence, without any intervening{gExtn} identifiers (in the 4×20 gauge). It is a design principle,absolute and strict, that a 3rd party reader should be able to iteratethrough the file from record ID 1 to the last record ID, and request thebinary type identifier (as a ref) and thence the binary type identifier(preferably a UUID) defining the binary type. They may then read or actupon such information as appropriate.

If data is written ‘en bloc’, disregarding the protocol, then the firstfour bytes of the record following the first user record will NOTrepresent a self-referential type, but random data (according to thatinput).

If the reading algorithm is fortunate, the incorrect type data soobtained will point to a non-GUID, or inappropriate type value, soindicating probable corruption (certain, in this case); if not, and itpoints to a record that happens to contain a GUID, worse still arecognised type GUID, then an entirely incorrect inference will bedrawn, without obvious error until subsequent actions and corruptionhave followed.

The use of the example storage protocol will now be explained in moredetail with respect to a computer system framework.

FIG. 4 illustrates a memory map of a storage device 20, on which dataaccording to the example protocol is stored. The storage device has amemory in which a file 22 has been created. The file 22 contains firstrecord 24 and a last record 26.

The unused (usable) space on the device is illustrated by region 28.This could be used merely by making the file in which the data is storedlarger. The limit to storage within a single data store is then eitherdecided according to which is smaller, the remaining protocol capacity,or remaining device capacity. If the remaining device capacity is lessthan the remaining protocol capacity, then a region, here region 30,will be theoretically valid in the protocol, but inaccessible, since nodevice capacity remains to implement it.

As discussed above the protocol capacity is limited by the gauge, andspecifically the number of bytes allowed to specify the record referenceto binary type. In this example, the usable device capacity is less thanthat of the protocol, resulting in region 30.

If on the other hand, the device is large enough to encompass the fullremaining protocol, then it is the protocol that will limit the singlestore capacity, as references to records beyond the protocol's lastrecord ID will return errors, if the protocol is correctly implemented.This is a safety measure to ensure that a file created consistent withthe protocol will always be readable by another algorithm codedconsistently with the protocol. Region 32 illustrates unusable devicecapacity outside of the protocol.

FIGS. 5 and 6 illustrate how the data protocol could be used in a widersystem. FIG. 5 illustrates application 34 for reading and writing dataaccording to the protocol described above to and from a device 20.Device 20 may be any suitable storage device or medium, such as internalmemory, memory provided on a network, a hard disk, or portable memorydevice.

The application 34 is shown as having a front end 36 for providing agraphical user interface for a user to enter and view data. Theapplication 34 also includes back end application 38 for handling thewriting and reading of data to the data store 20. Back end application38 has a “read data” control element or process 40 and a “write data”control element or process 42. It will be appreciated that although thefront and back end applications and read and write processes are shownas separate components they could be provided as a single monolithicapplication or as separate modules.

Read and write processes encode the protocol discussed above, such thatwhen data is written to or read from the store 20 the protocol isobeyed. During the reading and writing process, an encoding list orindex 44 is preferably consulted to ensure that the binary data in thestore 20 is interpreted correctly in terms of its type.

The encoding list or index 44 may be provided in memory on the samecomputer or server housing the application 34, or may be accessibleacross a network.

In the example discussed so far, it has been assumed that a singleapplication accesses a singe data store, whether remote or local.However, the advantages provided by the data protocol will be moreapparent when it is used on a network involving a number of differentcomputers and data stores. This case is illustrated in FIG. 6.

FIG. 6 shows a plurality of front end applications 36, which may beprovided on the same or different personal computers. The front endapplications communicate with back end applications 38 located on one ormore servers accessible via a network. The back end applications haveread and write processes 40 and 42 as before.

A plurality of data stores 20 are also illustrated. These may beprovided on separate servers, personal computers, or other storageresources available across a network.

As shown in FIG. 6, particular back end applications 38 may provideaccess to different data stores, allowing the user via a front endapplication to request one of several locations where the data is to bewritten or from where it may be read. As with FIG. 16, each of the readand write process utilises encoding list or index 44 is order tointerpret the data types stored in the data files.

Reading and Writing

Reference will now be made again to FIG. 3, to illustrate in more detailthe operations of reading and writing a file according to the preferredprotocol, described above.

The example file shown in FIG. 3, contains data that stores anidentifier for ‘London’, and a description of London, as a string. Thecomplexity may seem burdensome for such a simple item, but theconsequences of remaining strictly within the protocol and embodying thedata in this manner are that a simple, strict computer algorithm canaccept and process this file without human intervention, while retainingaccurate binary and structural integrity.

The example file comprises 22 records, diagrammatically divided intothree sections 12, 14 and 16 for the purpose of understanding typicalusage and roles. No such ‘sectional’ view is implicit or required by theprotocol itself.

The first section 12 contains typical critical records, such as leadingflags in records 1 and 2, that is signals that may be used to indicate afile's compliance with a particular reader/writer engine; a root UUIDdeclaration {gUUID} in record 3 (the GUID declaring the ‘GUID’ binarytype), which is self-referential; and an extension type {gExtn} inrecord 4. The extension type {gExtn} is declared as a GUID, by binarytype identifier ‘3’, indicating that it is of type {gUUID}. The contentsare deemed to be the identifier for an ‘extension’ record, as notedearlier.

Without a {gUUID} declaration, there is no root, and so no effectiveprotocol. Without {gExtn}, records are restricted to singleton records,and data per record to a fixed, gauge dependent width, here 16 bytes.The file is deemed to be a typical 4×20 file, refsize 4 bytes, 20 bytesrecord length, whence the TypeID is 4 bytes, and the DataBytes is 16bytes in length.

The second section 14 comprises typical common declarations for datatypes. A final application or file may have many more of these. Also,there is no requirement that they be all declared at file-inception. Incertain embodiments, novel types can be declared at any time. Thediagram illustrates five user-defined data types: Triple (record 5),String (record 6), Agent (record 7), Name (record 8) and WorldType(record 9).

The final section of the file 16, for discursive purposes, is the clientdata, which is where the final items of interest and their relations arenoted. The use of types to describe data will now be discussed in moredetail.

Of the example types defined in the common section 14, ‘{gString}’, fora string type declaration (itself of type 3: {gUUID}), may perhaps bethe only self-evident one. Data according to type ‘String is stored inrecords 16 to 21 for example. Note that records 16 to 20 contain thephrase “London is one of the world's leading cities, and capital to theUK”. This phrase is large enough to require storage in five records, allof which except the first are typed {gExtn} to show that they logicallyrelate to the preceding record.

We will briefly describe the other common types, so that the reader mayget a sense of how we regard and structure data:

{gTriple}: is a Triple, as defined in GB 2,368,929 (US Patentapplication 2005/0055363A), which allows declarations of the form:[subject].[relation].[object]. It obviates the need for schemadeclarations in databases and XML, and so supports spontaneous datacontribution, transfer, and absorption between data stores without humanintervention, at the structured data level. In the current example,three triples are declared, in records 12, 15, and 22:1) {gLondon}.{gName}. “London”2) {gDescription}.{gName}. “Description”3) {gLondon}.{gDescription}. “London is one of the world's leadingcities, and capital to the UK”

The approximate RDBMS equivalent of these triples is illustrated in the‘pseudo-tables’ in FIG. 7. It is beyond the scope of this application todescribe the equivalence and differences here, but the diagram may helpthe reader assemble the elements of the illustrated file more easilyinto a rational whole.

The other identifiers declared in the ‘common’ section (designated suchfor this discussion only) are:

{gString}—used for storing string types.{gAgent}—a common type beyond the scope of this embodiment.{gName}—used to declare an (English) name for a binary (GUID) identity{gWorldType}—provides classification, typically via a triple, since theprotocol does not need nor provide tables, with their explicit andrestrictive classifications.

The example could declare {gLondon}.{gWorldType}.{gCity} for example,but in the interests of brevity we have restricted the example to simplydeclaring a description for London.

It will be noted that {gString}, {gTriple} (also {gAgent}) and obviously{gUUID} all declare well-defined binary types. (Strictly, string issubject to encoding, and we use UTF8 in a typical embodiment). {gExtn}is a particular ‘binary type’ allowing continuation of binary types.

By contrast, {gName}, {gWorldType}, {gLondon}, {gDescription} are allconceptual types. There is no intended interpretation of 1's and 0's forthe concept of ‘classification’ ({gWorldType}). It is simply anidentifier for a concept, whereby we can ‘classify’ things, or likewise‘name’ them, or ‘describe’ them.

The instance data (in for example triples) will have an explicit binarytype (typically a string for a ‘name’, and a ‘GUID’ for an identifier),but that binary type belongs to the instance, not (as is implemented inRDBMS) to the field or relation, or concept itself.

The use of such identifiers is common in the art, and recognised inRDBMS, so will not expand further here, except to note their declarationin the example, and their usage (here, in triples).

Note also that we have not included the (English) names for thesedeclarations, for brevity, which we could otherwise have declared usingtriples and {gName}, as we have done for {gLondon} and {gDescription}.

By operating with GUID identifiers, we become language independent fordata, as far as the computer is concerned, though users will still needlocally interpreted language. We simply note here the mechanism for suchdeclarations.

We restrict ourselves to triples here, for structured relations, but anybinary bespoke type could be equally well created. To illustrate readingand writing such files, this example will suffice.

The absolute primitives upon which all other operations are based areReadSingleton, and WriteSingleton, as illustrated in FIGS. 8 and 9

We have stripped out the ‘Seek’ element, which will be covered in theRead Record and Write Record Operations described later. Here we simplynote that the action of reading a singleton is to read refsize bytes,where refsize is that determined by the gauge of the file, typically 4bytes as a signed integer.

Thereafter the reader reads the remaining databytes bytes, wheredatabytes is the other element in the gauge. The first four bytes aboveconstitute the Binary Type Identifier, and these latter 16 bytes the‘client data’.

Since the file is self-referential, the TypeID (the first four bytes asa reference to a record within this file), will be to a valid RecordID(integer >0, and <=the number of records within the file). In a typicaland well-defined file in the preferred embodiment, the TypeID will pointto (be a record ID reference for) a record, which will itself be a GUIDdeclaring the binary type of the client record.

To know what binary type our client data is, we read the GUID of thereferenced record, whose own TypeID, being a GUID, should be that of theroot {gUUID} declaration.

Thus, if it is not, we do not have an anticipated GUID, and as such wedo not have as we expected a well-defined file. Thus, the protocol isstrict, and it is readily determinable if it appears to have beenadhered to, in that regard.

Thus in the example, “London”, the string, in record 11, is declared astype 6, which references record 6, {gString}, whose own type is type 3,or {gUUID}, as expected, indicating that record 6 is indeed a GUID andwe can read its data and so derive the {gString} GUID, which tells usthe type of record 11, as we desire.

In practice, this apparently long-winded approach occurs only once percommon type, as once the {gString} record has been accessed once, it canbe stored in memory so that we simply map the ‘string’ type to ‘TypeID6’, (in this file), or as required in other files, so that we achievenearly the same performance as for hard-coded binary types, but whileretaining flexibility and independence as to binary type.

Writing a singleton occurs similarly, by writing its appropriate TypeID(record ID for the record in which the binary type GUID is declared) andthe associated data, bearing in mind that for a singleton, the datacannot exceed databytes bytes in length, in this example 16.

The one subtlety of a WriteSingleton request is that it must be ensured,if the write occurs at the end of the file, that all databytes bytes arewritten, else the file will no longer have integral length with respectto records, thus the write remainder bytes step in FIG. 9 ensures thatzeros are written to the file to ensure a consistent record size.

In order to make effective use of the file, we first initialise thefile, and check that we do indeed have a root declaration, and ifappropriate, an extension record. This is illustrated in FIG. 10, whichsimply acknowledges that before we can do proper work, we must firstvalidate these critical items.

The checks and actions can vary considerably in complexity, but at aminimum:

a) the file should be integral with respect to the presumed gaugeb) lead flags may be present and should be notedc) a root, self-referential, record for GUID should be presentd) a record for {gExtn} is strongly preferred

Without d), a {gExtn} type, all Read/Write operations are restricted toSingletons, and data of arbitrary length beyond a singleton data lengthmay not be stored. A {gExtn} type may be ‘late’ declared, but this isgenerally considered inadvisable. Early declaration (shortly orimmediately after the {gUuid} declaration) ensures that both reader andwriter are using the same {gExtn} identifier; else multi-record dataentered with one identifier {gExtn1} may if the reader assumes adifferent {gExtn} type ({gExtn2}) be misinterpreted as singleton data,with some ‘unfamiliar’ following singletons of type {gExtn1}. Earlydeclaration of the {gExtn} in use provides reassurance as to the commonagreement for the {gExtn} identifier in use.

If it is further desired to validate the file for consistency withrespect to eg: Type Declarations (all such binary types in the exampleare GUIDs), and or any particular specialist knowledge with respect toflags, that can be done at this time.

A specialist data store with a sophisticated indexing paradigm can usethe same protocol, but will want to be assured that it created and sohas some control over the higher level structure and indexing, overlaidonto the structure provided by the preferred protocol outlined here. Theadvantage of the structure is that the file remains readable, no matterhow complex, for both diagnostic, debugging, and data absorption,extraction and transfer purposes.

Once a file is ‘Ready’ to be read or written to, more formal operationscan begin. Ultimately, all operations hinge on low-level Read and Writeoperations, but given the carefully structured nature of the protocol,we do not advise allowing the user/developer access to a traditional‘Seek/Read/Write’ methodology.

Although the protocol supports data of arbitrary length, it must firstbe prepared or ‘striped’ into a buffer that is consistent with theprotocol, which process can in principle be understood with reference toFIG. 11.

The steps involved in Writing an arbitrary data block are:

In step 2) Evaluate the records required: the deemed gauge of the filedetermines the databytes per singleton, so for example, to write 40bytes, with a 4×20 gauge (with 16 data bytes per record) requires 3records: 16+16+8=40, with 8 bytes remaining unused in the 3rd record.

The final striped buffer for writing therefore will comprise threerecords, and since each record comprises 20 bytes (in 4×20 gauge), thatmeans a buffer of 60 bytes.

In Step 4) A buffer therefore of 60 bytes (3×20 bytes) is initialized tozero, into which the data can be ‘striped’.

In Step 6) the first singleton is written to the buffer and comprisesthe intended TypeID of the overall record (6, in our example, for a{gString}), followed by the first 16 bytes of our data (here: ‘London isone of’)

In step 8) while there is more data to write, step 10) writes furthersingletons to the buffer comprising the {gExtn} TypeID (here 4), and thefollowing 16 bytes of data, until the data is exhausted.

In Step 12) the resultant buffer is now striped into a form that isconsistent with the protocol and is ready to be written en-bloc′ to thefile as required. The process ends at Step 14.

It will be noted that this process, since it occurs in memory, isconsiderably faster generally than performing a sequence of individualwrites, and less risky than having to coordinate such a sequence in amulti-threaded environment.

Nevertheless, it is simply one illustration of how a record which maypossibly require extension records can be handled consistent with thepreferred protocol.

As illustrated in FIGS. 12 and 13, writing such buffers now follows thesimple

Seek/Write model, though in the preferred embodiment the Seek isimplicit in the Write method, by asking the client to designate theintended RecordlD (FIG. 12) in a call such as bool Write(int RecordID,TypeID rt, byte[ ] baData), or allowing the engine to perform the seek(FIG. 13) by moving to the end of the file in a call to intWriteNew(TypeID rt, byte[ ] baData). In which case, the function returnsan integer RecordID identifier for the record just written, or 0 or anegative integer for a failure. The write process beings in step 16,with a determination of the readiness of the engine. If not ready, theprocess exits in step 18.

In a multi-threaded environment in particular a distinction may be madebetween a writer being not ready by reason of the file being full, thewriter being uninitialized, or for corruption or other error (in whichcase the write fails and exits); and being not ready while waiting for awrite-access-permission (in which case the procedure can waitindefinitely or for some timeout, according to implementation).

A ‘Seek to record’ request is made in Step 20, and a query as to whethera valid write position has been obtained in Step 22. If the position isnot valid, an error is returned in step 24, and the process exits andwaits in step 26. If the position is valid, then the buffer is accessedto prepare the record bytes in step 28, and the bytes written in step30. A ‘success’ indicator is returned in step 32, whereupon the processexits in step 34.

It should be noted that implementations of the disclosed technologypreferably implement safety checks such that for example ‘bufferoverruns’ are avoided, by which a larger write is subsequently requestedover an original data record of smaller capacity. A ‘later’ request towrite data requiring 10 singletons over an ‘earlier’ record of say 8singletons would overwrite two following singleton records, causingprobable corruption of the data file except where such overwrittenrecords were carefully and previously identified as ‘spare’.

Such checks and procedures represent responsible coding practice as maybe expected to be understood and followed by individuals skilled in theart, and as such are not outlined here beyond intimating andacknowledging their appropriateness, and the protocol's capacity toaccommodate them.

The process of declaring a binary type is illustrated in FIG. 14 towhich reference should now be made. In order to declare a binary typesuch as {gString}, the core processes above are used, with the typicaladdition that the application or engine (36, 38) may preserve a list orindex of recognised and common identifiers, for performance reasons, andwill seek to ensure that such identifiers are re-used, rather thanhaving new identifications being repeatedly made.

These are preferences however, and according to the intent orspecification of the engine or file, it may provide sophisticatedindexing, or it may simply allow repeated re-declarations, each with adifferent identifier. Each is valid and appropriate, and neitherviolates the protocol, according to need.

The full process for contributing data then is to first declare itstype, and thence to declare a record with that TypeID, and the data, perthe lower-level functions outlined above. This is schematicallyillustrated in FIG. 15. As it is up-to-the user to identify the type forthe data, the engine is preferably provided with a look-up facility tosearch through the list or index of identifiers.

Reading Operations are illustrated in FIGS. 16 and 17. FIG. 16illustrates the operation of a single Extract Record Bytes. FIG. 14illustrates the actions involved in the read process, including theExtract record action. Reading data reverses the flow, based on the coreRead Singleton operation, which reads a TypeID (integer, 4 bytes in ourexample gauge), and some data. To ensure that it is not an extensionrecord, a full read requires a loop or algorithm to check subsequentrecords, and append the data part of each record (which will be typed as{gExtn}) to a buffer carrying the final data.

Without a ‘length’ field in the core algorithm, there is no magic meansof determining the correct and accurate length for such a buffer, butthe trade off is modest, given the increase in simplicity, and theavoidance of ambiguity outlined in earlier preamble. The ‘PrepareBuffer’ step in FIG. 16 is slightly simplified therefore, and variousmodes for its implementation would be apparent to the skilled developer.

Two simple and common approaches may for example be to store a list orcollection of the data segments, until the extensions are exhausted, andassemble them finally into a single contiguous data item; or to read inblocks of records (since disks habitually have an efficient ‘sector’size, typically in excess of the singleton size), and likewise make alist or collection of such blocks, examining each for the termination ofextension records, and so finally preparing and extracting the data intoa contiguous data object (typically, a byte array or coding objectrepresenting a record/data object with its type and data bytes).

The Read Record algorithm requires a ‘seek’ to the appropriate record,and thence an Extract Record Bytes operation as outlined in FIG. 16.Depending on the intent and nature of the operation, it may besufficient to return simply the TypeID in place of the binary type GUID,since if the end client algorithm wishes to validate or determine theGUID they can do so simply and directly by repeating the Read algorithmon the TypeID itself. In practice, typical reading embodiments will holdcommon TypeID's in memory, obviating the need for such a step, orallowing rapid assignment and determination of the associated GUID ifrequired. All other operations, as must be for any low level protocol,ultimately hinge on these critical operations for read and write, andgiven the nature of the protocol, it is well advised that they not onlybe carefully structured in practice to ensure that errors are handledbenignly, without corrupting the underlying data, but also thatultra-low-level file operations (seek, read and write of raw bytes,unstriped, and randomly within the file) are permitted only under themost controlled of circumstances.

In practice, such operations are likely to be entirely prohibited, giventheir risk (especially writing to a ‘random’ location within the file),in a ‘normal’ engine, though they may have some merit in a diagnosticengine. In practice again, however, even there, the simple andwell-defined structure of the protocol makes it far more effective andclear for diagnostics if the diagnostic-reader is also tuned to theintended gauge, using the RecordID=TypeID+Data pattern.

The overhead of data striping for extension records is a small price topay for clear and strict adherence to the protocol. With extensionrecords in place, the protocol can truly be said to support storage ofany type, of any length, subject only to the remaining capacity on thedevice, and in the protocol. (The protocol being limited by design to amaximum unambiguous reference id).

It will be appreciated that in the example data protocol provides atruly general data storage facility of well-defined but indiscriminate(not identified for knowledge-structure) data that may be advantageouslyused in combination with the truly general data structuring facility,that is the subject of GB 2,368,929 (pending US patent 2005/0055363A1),which offers the minimal solution to declaring external, or explicitlystructured data (akin to that in a relational database, but morepublicly accessible, and open).

The separation between the roles of advertisement of knowledge-structure(as typified by schemas and storage systems that rely on such, such asXML and RDBMS) and the accurate storage and identification of binaryobjects (of arbitrary or indiscriminate structure) is by design.

The biggest obstacle in the automated assimilation of data is theinappropriate use of binary (indiscriminate) identifiers to encapsulatenon-binary (human-knowledge) structures. This forces an interpretingalgorithm to become familiar with the ‘concept’ behind the binaryidentifier, which since human concepts are intrinsically arbitrary andsubject means that a file may only in practice be read by someone whoeither designed the original file or schema, or who has examined thefile or schema and believes that they understand it (by which token itis also apparent that it must have been written in a manner and languageunderstandable by the intended user).

This places an extremely high human dependency on the reading process,and would therefore be untenable in a system for universal and automatedmeans of data exchange and absorption. For this reason, in the preferredembodiment the interpretation of the binary data for computer(absorption) purposes is free of any such ‘human’ knowledgedependencies.

This is a key distinction between embodiments of the current disclosedprotocol and those such as XML and RDBMS, with their highhuman-knowledge dependencies woven into the binary nature of the storagerepresentations, which preclude their absorption into further, typicallylarger, binary stores by a simple automated process.

While the protocol is strict with respect to identification andstructure of its basic interpretation (records with self-referentialbinary-type identification, preferably via GUID), it makes nopresumption as to the ‘human’ knowledge aspects of the data, and as suchis freed from human-dependency for sharing and absorption, whileretaining the potential for higher-level knowledge encapsulation, viamechanisms such as Triples or other custom knowledge-encapsulating datatypes.

The preferred protocol is strict in allowing similar facilities to RDBMS(with suitable higher level modules), and so applications for use withthe protocol should implement suitably rigorous algorithms out ofrespect for the integrity of the data already. That the preferredprotocol allows unparalleled freedom to contribute data spontaneouslyand on the fly, even if of entirely novel type or structure, followsfrom the design and principles outlined herein. Beyond the freedom tocontribute lies the freedom to share, export or merge.

1. In a computing device that implements a multiple-binary-type datastorage mechanism, a method comprising: with the computing device thatimplements the multiple-binary-type data storage mechanism, writing aplurality of records to a data structure; and with the computing devicethat implements the multiple-binary-type data storage mechanism, storingthe data structure in a storage medium, wherein each record has the samelength in bytes, each record using a predetermined number of bytes tostore a reference to a binary type, the reference indicating the binarytype of data in the record, and using the remaining bytes to store datain the record, wherein records having different lengths in bytes are notpermitted in the data structure, wherein the reference to a binary typeis a reference to record that serves as an identifier for a binary type;and wherein the writing act comprises: a) writing a root record servingas an identifier for a root binary type, wherein the reference in theroot record is self-referential, and points to the root record; b)writing at least one record serving as an identifier for at least onebinary type of input data that is to be stored in the data structure,wherein the reference of the at least one record points to the rootrecord; and c) writing, in cases when the input data can be stored in asingle record, a record to store the data, wherein the type reference ofthe record points to a record defined in b) identifying the binary typefor that record.
 2. The method of claim 1, wherein the writing actfurther comprises: d) writing a record serving as an identifier for anextension binary type, wherein the reference of the record points to theroot record; the extension binary type indicating that the data in therecord has overflowed from the previous record; and e) writing, in caseswhen the input data is too large to be stored in a single record, afirst record to store the data, wherein the reference of the firstrecord points to a record defined in step b) identifying the binary typefor that record; and writing as many subsequent records as are necessaryto store the reminder of the data, wherein the reference of thesubsequent records points to the record identifying the extension binarytype defined in step d).
 3. The method of claim 1, wherein the recordsare written to the data structure such that no special characters appearin the written data.
 4. The method of claim 1, wherein writing recordsbegins at a cardinal offset of the logical data structure such thatrecords can be identified by ordinal index and positioned by means ofthat ordinal index.
 5. The method of claim 4, wherein the cardinaloffset is zero.
 6. The method of claim 1, wherein the records arewritten such that, apart from the type references used within records,no explicit record identifiers appear in the data structure.
 7. Themethod of claim 6, wherein the reference to another record is a numberindicating the position of that record within the data structure.
 8. Themethod of claim 7, wherein the reference number is a positive integer.9. The method of claim 1, wherein each record comprises only thepredetermined number of bytes for storing the reference to a recordserving as an indication of the record's binary type, and the bytes forstoring user data.
 10. The method of claim 9, wherein the reference toanother record is stored in the leading bytes of the record.
 11. Themethod of claim 9, wherein references to records can be embedded withinthe user data segments of records.
 12. The method of claim 1, whereinthe record serving as an identifier for the root data format, or the atleast one records serving as identifiers for input data, containrespective globally unique identifiers in the user data part of therecord.
 13. The method of claim 1, comprising writing non-user data toone or more records that do not contain references to other records inthe data structure.
 14. The method of claim 2, comprising receivinginput data and writing the user data to the last significant byte. 15.The method of claim 11, wherein any remaining bytes in the record arewritten as zeros.
 16. The method of claim 1, wherein the record is 20bytes in length and 4 bytes are used to store a reference to anotherrecord.
 17. The method of claim 1, wherein the storage medium comprisesa memory.
 18. The method of claim 1, wherein the storage mediumcomprises a hard disk.
 19. A computer readable medium having computercode stored thereon, wherein when the computer code is executed by acomputer processor it causes the computer processor to write a pluralityof records to a data structure, wherein each record has the same lengthin bytes, each record using a predetermined number of bytes to store areference to a binary type, the reference indicating the binary type ofdata in the record, and using the remaining bytes to store data in therecord, and wherein records having different lengths in bytes are notpermitted in the data structure; wherein the reference to a binary typeis a reference to record that serves as an identifier for a binary type;the writing step comprising: a) writing a root record serving as anidentifier for a root binary type, wherein the reference in the rootrecord is self-referential, and points to the root record; b) writing atleast one record serving as an identifier for at least one binary typeof input data that is to be stored in the data structure, wherein thereference of the at least one record points to the root record; c)writing, in cases when the input data can be stored in a single record,a record to store the data, wherein the reference of the record pointsto a record defined in b) identifying the binary type for that record.20. The computer readable medium of claim 19, wherein the computer codewhen executed by a computer processor, further causes the computerprocessor to: d) write a record serving as an identifier for anextension binary type, wherein the reference of the record points to theroot record; the extension binary type indicating that the data in therecord has overflowed from the previous record; and e) write, in caseswhen the input data is too large to be stored in a single record, afirst record to store the data, wherein the reference of the firstrecord points to a record defined in step b) identifying the binary typefor that record; and writing as many subsequent records as are necessaryto store the reminder of the data, wherein the reference of thesubsequent records points to the record identifying the extension binarytype defined in step d).
 21. The computer readable medium of claim 19,wherein the computer code when executed by a computer processor, furthercauses the computer processor to write to the logical data structuresuch that the data is indiscriminate and unrestricted as to specialcharacters.
 22. The computer readable medium of claim 19, wherein thecomputer code when executed by a computer processor, further causes thecomputer processor to begin writing records at a cardinal offset of thelogical data structure such that records can be identified by ordinalindex and positioned by means of that ordinal index.
 23. The computerreadable medium of claim 22, wherein the cardinal offset is zero. 24.The computer readable medium of claim 19, wherein the computer code whenexecuted by a computer processor, further causes the computer processorto write the records such that, apart from the type references usedwithin records, no explicit record identifiers appear in the datastructure.
 25. The computer readable medium of claim 23, wherein thereference to another record is a number indicating the position of thatrecord within the data structure.
 26. The computer readable medium ofclaim 25, wherein the reference number is a positive integer.
 27. Thecomputer readable medium of claim 19, wherein the computer code whenexecuted by a computer processor, further causes the computer processorto write each record so that it comprises only the predetermined numberof bytes for storing the reference to a record serving as an indicationof the record's binary type, and the bytes for storing user data. 28.The computer readable medium of claim 27, wherein the reference toanother record is stored in the leading bytes of the record.
 29. Thecomputer readable medium of claim 27, wherein references to records canbe embedded within the user data segments of records.
 30. The computerreadable medium of claim 19, wherein the computer code when executed bya computer processor, further causes the computer processor to write therecord serving as an identifier for the root data format, or the atleast one records serving as identifiers for input data, containrespective globally unique identifiers in the user data part of therecord.
 31. The computer readable medium of claim 19, wherein thecomputer code when executed by a computer processor, further causes thecomputer processor to write non-user data to one or more records that donot contain references to other records in the data structure.
 32. Thecomputer readable medium of claim 19, wherein the computer code whenexecuted by a computer processor, further causes the computer processorto receive input data and write the input data to the last significantbyte.
 33. The computer readable medium of claim 32, wherein the computercode when executed by a computer processor, further causes the computerprocessor to write any remaining bytes in the record as zeros.
 34. Thecomputer readable medium of claim 19, wherein the record is 20 bytes inlength and 4 bytes are used to store a reference to another record. 35.The computer readable medium of claim 19, wherein the computer readablemedium comprises a memory or a hard disk.
 36. A computer readable mediumhaving stored thereon a data structure for storing data of multiplebinary types in a single logical data structure, the data structurecomprising: a plurality of records, wherein each record has the samelength in bytes, each record using a predetermined number of bytes tostore a reference to a binary type, the reference indicating the binarytype of data in the record, and the remaining bytes to store data in therecord, and wherein records having different lengths in bytes are notpermitted in the data structure, wherein the reference to a binary typeis a reference to record that serves as an identifier for a binary type;and wherein the records comprise at least: a) a root record serving asan identifier for a root binary type, wherein the reference in the rootrecord is self-referential, and points to the root record; b) at leastone record serving as an identifier for at least one binary type ofinput data that is to be stored in the data structure, wherein thereference of the at least one record points to the root record; c) arecord storing data, wherein the reference of the record points to arecord defined in b) identifying the binary type for that record. 37.The computer readable medium of claim 36, wherein the records comprise:d) a record serving as an identifier for an extension binary type,wherein the reference of the record points to the root record; theextension binary type indicating that the data in the record hasoverflowed from the previous record; and e) at least one first recordstoring data, wherein the reference of the first record points to arecord defined in step b) identifying the binary type for that record;and at least one subsequent records to store the reminder of the data,wherein the reference of the subsequent records points to the recordidentifying the extension binary type defined in step d).
 38. Thecomputer readable medium of claim 36, wherein the computer readablemedium comprises a memory or a hard disk.